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ACOUSTIC BEAM FORMING WITH ROBUST SIGNAL ESTIMATION 



BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention relates to audio signal processing, and, in particular, to acoustic beam 
5 forming with an array of microphones. 

Description of the Related Art 

Microphone arrays can be focused onto a volume of space by appropriately scaling and delaying 
the signals from the microphones, and then linearly combining the signals from each microphone. As a 
result, signals from the focal volume add, and signals from elsewhere (i.e., outside the focal volume) tend 
10 to cancel out. 

q One of the problems with a simple linear combination of signals is that it does not address the 

*fi situation when n oise o ccurs at or near one of the microphones in the array. In a simple linear 

m T — ■ \ 

\l combination of signals, such noise appears in the resulting combined signal. 

U! There is prior art for canceling noise sources whose positions are known, such as those based on 

|4 radar jamming countermeasures, where the delays and scales of the different microphones are adjusted to 

^ produce a null at the known position of the noise source. These techniques are not applicable if the 

□ position of the noise source is not well known, or if the noise is generated over a relatively large region 

if! (e.g., larger than a quarter wavelength across), or in a strongly reverberant environment where these are 

\ y 

y many echoes of the noise source. 

jSJ Other prior art techniques for noise suppression, such as spectral subtraction techniques, operate 



in the frequency domain to attenuate the signal at frequencies where the sighal-to-noise ratio is low. In 
the context of acoustic beam forming, such techniques would be applied independently to individual 
audio signals, either before the signals from the different microphones are combined or, after that 
combination, to the single resulting combined signal. 

25 SUMMARY OF THE INVENTION 

The present invention is directed to a technique for noise suppression during acoustic beam 
forming with microphone arrays when the location of the noise source is unknown and/or the frequency 
characteristics of the noise are not known. According to the present invention, noise suppression is 
achieved by combining the audio signals from the various microphones in an appropriate nonlinear 

30 manner. 
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In one implementation of the present invention, the individual microphone signals are filtered 
(e.g., shifted and scaled), but, instead of simply adding them as in the prior art, a samplerby-sample 
median is t aken across the different microphone signals. Since the median has the property of ignoring 
outl ying data, lar ge extraneous si gnals that appear on less than half of the microphones are ignored. 

Other implementations of the present invention use a robust signal estimator intermediate 
between a median and a meaa. A representative example is a trimmed mean, where some of the highest 
and lowest samples are ex cl uded befor e taking the mean of the remaining samples. Such an estimator 
will yield better rejection of sound originating outside the focal volume. It will also yield lower 
harmonic distortion of such sound. 

The present invention is computationally inexpensive, and does not require knowledge of the 
position of the noise source. It works well on spread-out noise sources that are spread out over regions 
small compared to the array size. It also has the additional bonus of rejecting impulse noise at high 
frequencies, even from sources that are not near a microphone. 

Another advantage over the prior art is that the resultant signal from the present invention can be 
much less reverberant than can be produced by any prior art linear signal processing technique. In many 
rooms, sound waves will reflect many times off the walls, and thus each microphone picks up delayed 
echoes of the source. The present invention suppresses these echoes, as the echoes tend not to appear 
simultaneously in all microphones. 

In one embodiment, the present invention is a method for processing audio signals generated by 
an array of two or more microphones, comprising the steps of (a) filtering the audio signal from each 
microphone to generate a processed audio signal for each microphone and combining the processed audio 
signals to form an acoustic beam that focuses thearrajy^one,^^ 

space; and (b) performing nonlinear signal estimation processing on the processed audio signals from the 
microphones to generate an output signal for the array, wherein the nonlinear signal estimation 
processing discriminates against noise originating at an unknown location outside of the one or more 
desired regions, where the term "noise" can be read to include delayed reflections of the original signal 
(i.e., reverberations). 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other aspects, features, and advantages of the present invention will become more frilly apparent 
from the following detailed description, the appended claims, and the accompanying drawings in which: 

Fig. 1 shows a block diagram of audio signal processing performed to implement dynamic 
acoustic beam forming for an array of N microphones, according to one embodiment of the present 
invention; and 
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Figs. 2-6 show results of simulations comparing a system having a robust signal estimator of the 
present invention with a system utilizing a prior-art linear combination of microphone signals. 

DETAILED DESCRIPTION 

Fig. 1 shows a block diagram of audio signal processing performed to implement dynamic 
acoustic beam forming for an array of AT microphones, according to one embodiment of the present 
invention. As used in this specification, the term "acoustic signal" refers to the air vibrations 
corresponding to actual sounds, while the term "audio signal" refers to the electrical signal generated by 
a microphone in response to a received acoustic signal. 

As shown in Fig. 1, the audio signal generated by each microphone is independently subjected to 
a processing channel comprising the steps of input filtering 102, intermediate filtering 104, and pre- 
emphasis filtering 106. Input filtering 102, which is preferably digital filtering, matches the frequency 
response of the corresponding combined microphone-filter system to a desired standard. In one 
embodiment, intermediate filtering 104 comprises delay and scaling filtering that delays and scales the 
corresponding digitally filtered audio signal so that, when the different audio signals are eventually 
combined (during robust signal estimation 108), they will form the desired acoustic beam. According to 
the present invention, an acoustic beam results from an array of two or more microphones, whose 
effective combined response is focused on one or more desired three-dimensional regions of space within 
a particular volume (e.g., a room). 

In addition to or instead of delay and scaling, intermediate filtering 104 may contain a digital 
filter (e.g., a finite impulse response (FIR) filter). In one embodiment, where the system is used to reduce 
room reverberations, intermediate filtering 104 provides an approximate inverse to the room's transfer 
function. Although shown in Fig. 1 as separate elements, in other implementations, input filtering 102 
and intermediate filtering 104 may be combined. In a preferred embodiment, after intermediate filtering 
104, each audio signal is subjected to identical pre-emphasis filtering 106. 

After pre-emphasis filtering 106, the TV processed audio signals from the N microphones are 
combined according to a robust signal estimator 108, and the resulting combined audio signal is 
subjected to output (e.g., de-emphasis) filtering 110 to generate the output signal. Robust signal 
estimation 108 is described in Anther detail later in this specification. Output filtering 110, which may 
be implemented using a Wiener filter, is applied to shape the output spectrum and improve the overall 
signal-to-noise ratio. 

As shown in Fig. 1, the audio signal processing provides dynamic control over the acoustic beam 
steering implemented by the N intermediate filtering steps 104. In particular, dynamic steering control 
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112 receives the outputs from the N input filtering steps 102 (or, alternatively, the outputs from the N 
pre-emphasis filtering steps 106) as well as the final output signal from robust signal estimator 108 (or, 
alternatively, the output signal from output filtering 110) and generates control signals that dictate the 
amounts of delay and scaling for the N intermediate filtering steps 104. In a preferred embodiment, 
dynamic steering control 112 attempts to adjust each intermediate filter 104 such that the output from the 
corresponding pre-emphasis filter 106 matches (in both amplitude and phase) the output signal generated 
by output filter 110. 

In addition, the audio signal processing of Fig. 1 provides dynamic control over the combining of 
audio signals implemented by robust signal estimation step 108. In particular, signal analysis 114 
performs statistical analysis on the outputs from pre-emphasis filters 106 and the output signal from 
robust signal estimator 108 (or, alternatively, the output signal from output filtering 110) to generate 
statistical measures (e.g., the variance of the differences between the N inputs to robust signal estimator 
108 and the output from robust signal estimator 108) used by dynamic estimation control 116 to 
dynamically control the operations of robust signal estimation 108. For example, when robust signal 
estimator 108 performs a weighted combination of audio signals, dynamic estimation control 116 
dynamically adjusts the different weights applied by robust signal estimator 108 to the different audio 
signals from different microphones. 

Note that the thick arrows in Fig. 1 flowing (1) from the column of input filters 102 to dynamic 
steering control 112, (2) from dynamic steering control 112 to the column of intermediate filters 104, and 
(3) from the column of pre-emphasis filters 106 to signal analysis 114 are intended to indicate that 
signals are flowing from all N of the input filters 102, to all N of the intermediate filters 104, and from all 
N of the pre-emphasis filters 106, respectively. 

Either or both of the feedback loops in Fig. 1 may be omitted for particular embodiments that do 
not provide the corresponding type(s) of dynamic control over the audio signal processing. 

The audio signal processing of Fig. 1, which uses a nonlinear operator to combine the various 
input signals, can be implemented in a low-delay pipelined manner. The combination step of robust 
signal estimation 108 preferably operates on a single sample (from each microphone), so the whole 
system can operate with delays much smaller than techniques that require a buffer to be accumulated and 
a transform (e.g., FFT) performed on the buffer. The output signal bears a definite phase relationship to 
the input signal, unlike many spectral subtraction techniques. 
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Robust Signal Estimation 

Robust signal estimation 108 of Fig. 1 may be implemented in a variety of different ways that 
share the following similar nonlinear concept: each implementation picks a representative, central value 
from a collection of inputs by dropping or altering extreme data, such that the resulting central estimate is 
5 robust against (i.e., relatively insensitive to) wild variations of one input or possibly even a few inputs. 
With robust signal estimation according to the present invention, any one input value can vary from 
positive infinity to negative infinity without affecting the resulting output by more than a relatively small, 
finite amount. 

One type of robust signal estimation is based on the median. In a median estimator, the 
10 individual microphone signals are individually filtered, shifted, and scaled, as indicated by the N parallel 
processing paths in Fig. 1, but, instead of being simply added as in prior-art techniques that rely on a 
linear combination of signals, the jiudio signals are "co mbined " in a nonlinear manner by taking the 
« samp le-b y-sample median across the different microphone signals. In other words, at any given time, the 
\Q output signal is selected as the median of the current values for the signals from the N microphones, 
ijjjj Since the median has the property of ig noring oi Ulying-data, Jarge,exfranec>us signals that appear on less 
Lfl than half of the microphones will Ineffectively ignored. . 

Another type of robust signal estimation is based on a trimmed mean, where, for each set of 
p current input values for the N microphones, one or more of both the highest and lowest input values are 
q dropped, and the output is then generated as the mean of the remaining values. A trimmed mean 

Wt estimator combines features of both a median (e.g., dropping the highest and lowest values) and a mean 

FU 

I s | (e.g., averaging the remaining values). With large arrays, (e.g., 10 or more microphones), it may be 

O advantageous to trim more than one datum on each end. 
□ 

Another type of robust signal estimation is based on a weighted, trimmed mean, where, for each 
set of current input values for the # microphones, after one or more of the highest and lowest input 
25 values are dropped (as in the trimmed mean), one or more of the remaining highest and lowest inputs 
values (or even as many as all of the remaining inputs) are weighted by specified factors m>, having 
magnitudes less than 1 to reduce the impact of those inputs when subsequently generating the output as 
the mean of the remaining weighted values. 

Trimmed mean and weighted trimmed mean estimators, which are intermediate between a 
30 median and a mean, tend to yield less distortion for and also better rejection of sound originating outside 
the focal volume. 

Another type of robust signal estimation is based on a Winsorize d mean , which is calculated by 
adjusting the value of the highest datum down to match the next-highest, adjusting the lowest datum up 
to match the next lowest, and then averaging the adjusted points. As long as the second-highest and 
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second-lowest points are reasonable, the extreme points can vary wildly, with little effect on the central 
estimate. With large arrays (e.g., ten or more microphones), it may be advantageous to 'Svinsorize" 
(adjust) more than one datum on each end. 

The different types of robust signal estimation described so far treat each set of input values 
5 independently. In other words, there is no filtering or integration that occurs over time. In alternative 

embodiments, the various types of robust signal estimation can be modified to use multiple samples from 
each microphone, either averaging over time or performing some other suitable type of temporal filtering. 
For example, a median-like operator can be implemented based on an arbitrary distance measure, which 
can be based on multiple samples for each microphone. For instance, the distance between two 
10 sequences can be defined to be a perceptually weighted distance, perhaps obtained by subtracting the 

sequences, convolving with a kernel, and squaring. At each sample, the microphone that "sounds" most 
typical can be identified and the output can then be selected as the signal from that microphone. The 
^ most-typical microphone could be defined as the one with the smallest sum of differences with respect to 
Jj the other microphones, or using other techniques specially designed to exclude outliers, 
fjjj Another implementation would be to use a single-sample estimator as described above, but 

U1 dynamically change the weights given to each microphone, e.g., based on the ratio of power in the speech 
{2 band to the power outside that band. This dynamic implementation can be implemented using the signal 
Q analysis 114 and dynamic estimation control 116 modules shown in Fig. 1 . 

g In one sample implementation optimized for processing human speech, signal analysis 114 could 

M calculate the amount of power output at each pre-emphasis filter 106 that is ( 1 ) coherent with the output 

ry 

y of robust signal estimator 108 and (2) within a frequency band that contains most speech information 
9 (e.g., from about 100Hz to about 3 kHz). It could also calculate the total power output from each of pre- 
emphasis filters 106. Dynamic estimation control 116 could then set the weight for each input to robust 
signal estimator 108 to be the ratio of the first power to the total power for that channel. Speech-like 
25 signals would then be given more weight. Likewise, signals that agree with the output of robust signal 
estimator 108 (and thus agree with each other) would also be weighted more heavily. 

Setup 

As suggested by the previous discussion of Fig. 1, before the audio signal processing algorithm is 
applied, the frequency response and phase delay of each microphone are measured. For each 
30 microphone, the corresponding input filter 102 is then set to match the frequency response of each 

combined microphone-filter system to a desired standard. The standard frequency response is typically 
set to be substantially flat between 100 and 10,000 Hz. 
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For a given source position (i.e., the desired acoustic beam focal point), the time delays and 
scaling levels for step 104 are then generated in order to match the phases and amplitudes of the audio 
signal in each channel. To get good noise rejection, the N scaling levels should be chosen so that, after 
the scaling of step 104, the audio signals will have the same magnitude in each channel. 

Consider, for example, a trimmed mean estimator that drops the highest and lowest values, and 
then averages the rest. The noise suppression results from dropping the extreme points. Like many 
robust estimators, a trimmed mean estimator has the property that any single input value can vary from 
positive infinity to negative infinity, and yet change the resulting output by a finite amount. The majority 

1/2 

of this change typically occurs when a given input, e.g., input j 9 is within AVj & (var{v / ;/ ^ j}) of 



To get good noise rejection, the scaling levels should be chosen such that the resulting signals in 
the different channels have the same magnitude after intermediate filtering 104. This can be seen by 
considering the trimmed mean. The noise suppression results from dropping the extreme samples. If the 
input values to the robust estimator are widely spread (i.e., Av y is large), then a noise signal on some 

channel must reach a relatively large amplitude before it becomes large enough to be dropped. To 
minimize the spread Av y of the non-noisy input values, the amplitudes and phases of the signals input to 

robust signal estimation 108 are matched. Since the amplitudes are constrained to match each other, 
weights are introduced, which will allow some data to be marked as unimportant or noisy. These weights 
may be used by the robust estimator step. 

In addition, it is desirable to minimize the generation of intermodulation distortion products in 
the robust estimator module. These products arise from the nonlinear nature of the robust estimator, and, 

for uncorrected inputs, typically have amplitudes on the order of AV « (var{v J }) l/2 I N , where TV is 

the number of input values. Again, this can be made small by matching the input voltages, but it can also 
be reduced by using a larger microphone array, thereby increasing N. 

In a case where room reverberation is unimportant, the microphones are in the far field, and the 
dominant sound propagation is a direct path through free space. The desired time delays for filters 104 
are then /, = (max{<i ( }—d i )lc 9 and the desired microphone gains for filters 104 are proportional to 

dp where d f is the distance from the source to the rth microphone, and c is the speed of sound. These 
choices work adequately in normally reverberant rooms, though the rejection of interfering signals will 
not be optimal, and some extra intermodulation distortion will be introduced. 

IDS#119I67 (990.0234) -7- Kochanski 52-16 



the mean of |v ( ;/ ^ j\ , where v i is the voltage on the rth input. 



In a more realistic system where echoes and other effects are important, or where higher quality 
sound is required, the delays and scalings would be generalized into full digital filters. For noise 
suppression, those filters are preferably chosen based on two criteria. 

First, the desired signal (i.e., a signal from the focal volume) should appear nearly identical at the 
outputs of all of the intermediate filters 104. Any mismatch between the signals will both (1) increase 
the trimming threshold of the robust estimator 108, making the system more sensitive to unwanted 
signals and (2) introduce intermodulation distortion products into the output signal. 

Second, the intermediate filters 104 should be chosen to have a compact impulse response in the 
time domain. As the filter's impulse response becomes longer, the energy of rogue signals (i.e., signals 
not from the focal volume) will be spread over more samples. As a result, they will not be trimmed as 
effectively by the robust estimator. 

Generally, these criteria cannot be satisfied simultaneously, and a design will involve careful 
tradeoffs between the constraints, which conflict when the room's impulse response becomes long. Since 
the room's impulse response will vary from one microphone to another, exact matching of the desired 
signal on different channels would require digital filters whose impulse response is as long as the room's 
reverberation time. On the other hand, the rogue signals that are most easily rejected come from close to 
one microphone or another. In those cases, the room reverberation is relatively unimportant, since the 
rogue signals predominantly come on the direct path, not via reflections. Processing these rogue signals 
through a set of filters that is adjusted to match signals from the focal volume will generally spread the 
rogue signals and reduce their peak amplitude, so that they will not be cleanly trimmed away. For noise 
suppression, one needs to choose these matching filters to be a compromise between accurate matching 
of the desired signal and excessive broadening of rogue signals. On the other hand, a room de- 
reverberation application puts strong emphasis on matching the signals from the focal volume, and little 
or no emphasis on rejection of rogue signals that originate near a microphone. 

For noise suppression, filters that make a good compromise can be calculated by minimizing the 

energy functional /? over the space of all filters. The energy functional /? measures the energy of 

rogue signals that can pass through the robust estimator, for a fixed sensitivity to signals that originate in 
the focal volume. Specifically, each microphone is imaginarily probed with a set of test signals p a {d)) , 

whose peak amplitudes are adjusted to just match the estimator's trimming threshold. The energy coming 
out of the system is measured and then averaged over all microphones and all test signals. 

In the case of a trimmed mean as a robust point estimator, the energy functional ft is given by 
Equation (1) as follows: 
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T 



a,j 



\PajJ 



(i) 



where p a {cai) is the probe pulse, a selects which of the test signals is applied, Aj(eo) is the gain of 
the/th channel input amplifier 104 and filter 106, Wj is the weight given to they'th channel in the 
trimmed mean (under the constraint ^ Wj r = 1 ), and Tis the trimming threshold. The peak amplitude 



of the probe pulse, after the amplifiers and filters is given by Equation (2) as follows: 

p aj = max| jp B {a) Aj (a>)e ia, d(D . 



(2) 



As such, Tf p a j is the factor by which the probe pulse should be scaled to just reach the robust 

estimator's trimming threshold. The requirement for fixed sensitivity in the focal volume is given by 
Equation (3) as follows: 



where Hj (a)) is the transfer function for sound propagating from the desired source to they'th 

microphone. The constraint of Equation (3) has been assumed to eliminate the degeneracy of the 
solution for {w,}. Relaxing this constraint applies an overall multiplier to the output signal. 

The trimming threshold T should be calculated in the presence of a typical signal and a typical 
noise environment. The signal s(a)) from the focal volume (i.e., the desired signal) and noise N j(a>) 

can be approximated by stationary random processes. It is also assumed that the noise is not correlated 
between microphones. This assumption of uncorrected noise becomes invalid for small arrays at low 
frequencies, and will limit the applicability of this analysis for noisy rooms. It is further assumed that the 
trimmed mean is only lightly trimmed, so that the untrimmed mean is a good first estimate for the 
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trimmed mean. Since the untrimmed mean is s(o)) , the deviations from the untrimmed mean can be 
expressed by Equation (4) as follows: 

Vj{a) = Nj(a>)Aj(aywj + s{a))[H d j {o))A j {Q)) - \)wj , (4) 

in order to calculate Equation (S) as follows: 

var{v,} = var{¥,} = £ w) \[\Nj(a>)Aj(o>)\ 2 +\s(a>)\ 2 .\H*(a>)Aj(a>) - lf)to . (5) 

J 

From there, it is assumed that v, has a reasonably Gaussian probability distribution. This condition is met 
if the signals are approximately Gaussian and their amplitudes are approximately equal. As such, the 
trimming threshold can be solved using Equation (6) as follows: 

erf(r/(var{v y }) 1/2 ) = 1 -2M/N , (6) 

which corresponds to trimming M microphones off each end of the probability distribution. Note that T 
is really a time-varying quantity, especially in a system with only a few microphones, and an 
approximation is made by giving it a single, constant value. 

The best set of weights depends on the expected noise sources, how close to the microphone they 
are, and various psychoacoustic factors. In practice, a good solution is to set the threshold so that (on 
average) one or two microphones are trimmed away (M=0.5 or A^l) . As M -» N/2 , the robust 

estimator approaches a median that typically yields too much distortion. 

While the above equations may be solvable numerically in the general case, some insight can be 

gained analytically. A useful limit is where the incoherent noise Nj(o)) is small. Then, Equation (5), 

which sets the trimming threshold T, is dominated by the term proportional to s, and the trimming 
threshold T is proportional to the mismatch between the signals presented to the robust estimator. For 

free-space propagation, the strongest dependence of the energy functional fi on any adjustable 

parameter (i.e., Wj or Aj (a)) is through which leads to the intuitive result that it is best to match the 

signals at the input to the robust estimator. This limit is found to be useful for a room de-reverberation 
application. 

Optimal Weights for Free-Space Propagation With Noise 

Working with free-space propagation, the optimal weights can be extracted. In that case, 
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H"(o)) = —e ia>dj/c (7) 
d j 

and 

Aj(a>) = l/H*(a>) (8) 
If the root-mean-square (RMS) noise voltage at each input to the robust estimator is almost the same, i.e., 

N] = l\Nj(a>)Aj(a>)fda> *N 9 (9) 



then it can be shown that: 



^ocJwJwjA?, (10) 



% Equation (1) simplifies dramatically because the transfer function times the gain is independent of 
fj frequency. One of the factors Wj comes from Equation (1) and the other factors w\N\ come from 



l|P Equation (5). The weights that optimize the energy functional /? can be found analytically according to 

O 

Equation ( 1 1 ) as follows: 

ru 

Numerical experiments confirm the exponent, and show that this relationship is valid to within 20% for 

20 microphones and 0.3 < Nj JN < 3 . Therefore, under these assumptions, the optimal weights are a 



15 function of distance from the source to the microphones, as given by Equation (12) as follows: 

-3/2 



Wj*(dj)~ • 02) 



Optimal Amplifier Response 

By taking a different limit, the optimal gain A j {(D) can be calculated for a symmetrical 

microphone array, where noises are equal. For simplicity, the noise and signals may be assumed to be 
20 white. The transfer function is a direct path plus a single reflection, as given by Equation (13) as 
follows: 
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H j (a>) = d] l e i ^ c (\ + a j e ie ' I >), (13) 

where dj is the distance of the microphone from the noise source, ctj is the echo strength (where 

|a 7 1 « 1 is assumed), and z; is the delay associated with the echo. Assuming that the delay matches 

the echo, the amplifier gain A can be parameterized according to Equation (14) as follows: 
5 A J W-d J e-"' /e (l + r J e'"'y. ( 14 ) 

where y s is the amplifier's response function. How completely the amplifiers should cancel the echo can 
be determined by finding the change to the amplifier's response function that will minimize the energy 

functional J3 . Since this is a symmetric array, all of the distances are assumed identical, 
n The gain A . (co) can be calculated in the general case by decomposing the room impulse 

5 

liH response function into individual echoes, and calculating yfor each a. 
If The most interesting term in this problem becomes the trimming threshold 7 1 , which is 

u i 

yQ proportional to var{v } via Equation (5) as follows: 

U 

Q Tierf~\\ -2M/N) = var{v,} = N 2 (l + y 2 ) + S 2 (a - yf (15) 

s 

neglecting higher-order terms in or and y. For large signals, Equation (15 ) is dominated by the mismatch 
between the amplifier response and the transfer function, while, for small signals, it is dominated by the 
j£j amplified noise. 

^ The rest of the expression for the energy functional /? is independent of S and N. For several 

interesting limits, it can also be shown to be independent of df and y. Specifically, if the probe pulse is 
nearly Gaussian and has small autocorrelation at an interval of r, then: 

20 J A ( 16 ) 



is independent of a and y. Minimizing the energy functional p is then equivalent to minimizing 
var{y,}, the optimal value is given by Equation (17) as follows: 

f opt 



y opl =aS 2 /(S 2 + N 2 ). (17) 
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In the more general case of non- white spectra, the optimal value is given by Equation (18) as follows: 

. y opt =aS 2 /(S 2 +Tj 2 N 2 ), (18) 

where tj is a function of the signal and noise spectral shapes, along with r. 

Equation (17) can be used to guide the choice of amplifier response function under more complex 

conditions. To do this, the definition of the noise Nj((o) needs analysis. The properties of the noise 

that are relied on in subsequent derivations are just that it is uncorrected with the signal, and 
uncorrected from one microphone to another. If the tail end of the transfer function of a reverberant 
room is considered, it is easy to see that it can share the same properties. For many signals (e.g., speech 
or music), the signal is non-stationary and changes every few hundred milliseconds. The reverberations 
become uncorrected with the signal coming on the direct path, because the speaker has gone onto a new 
phoneme, while the listener still hears the reverberations of the previous phoneme. Likewise, 
microphone-to-microphone correlations disappear in the tail of the reverberation, especially at high 
frequencies, as each microphone sees a different sum of many randomly phased reflections from room 
surfaces. Equation (18) can then be applied to the situation, interpreting N as the diffusely generated 
noise plus the part of the room reverberation that is not cancelled out by the amplifiers. 

With this model in mind, a good impulse response can be designed for the amplifiers, reflection 
by reflection. The process starts with the direct path, then applies Equation (1 8) to each image of the 
source in turn. At some point, y opt will become small, because the individual reflections are 
exponentially diminishing in amplitude. At that point, the process stops, and all the power in the 
remaining reflections is treated as noise. In practice, the process may be limited first by changes in the 
room's transfer function, as sources and/or microphones move, or reflections off moving objects change. 

Perceptual Weighting 

In actuality, the model should be somewhat more complex than described above. The effect of 
the rogue probe pulse should be perceptually weighted in Equation (1), since larger intrusions can be 
tolerated at low and very high frequencies, and larger intrusions can be tolerated at frequencies and times 
where there is a lot of signal power. Adding the extra terms into the model will introduce a pre-emphasis 
filter 106 before the robust estimator 108, and a de-emphasis output filter 110 after. The pre-emphasis 
filter 106 will reduce the amplitude of perceptually unimportant noise (and thus reduce the trimming 
threshold by reducing the variance of the signals presented to the robust estimator). One implementation 
of filter 106 is to introduce a high-pass filter into amplifier 104, with a cutoff frequency of 50-100Hz. 
Such a filter can drastically reduce the trimming threshold, by eliminating low-frequency rumble such as 
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that caused by ventilation systems. In addition to improving the system's ability to reject rogue signals, 
removing the low-frequency rumble will reduce and possibly eliminate the intermodulation distortion 
products of the rumble, many of which could be at frequencies high enough to be annoying. 

Experimental Procedure 

5 The processing of Fig. 1 was simulated to test its behavior. All tests were done by calculating 

free-space sound propagation in a simulated room (a rectangular prism, extended with some added jitter 
in reflection positions and coupling between modes to simulate bounces off furniture and other 
deviations from perfect box-like geometry). 

The simulated room was 7m x 3.5m x 3m high, with reverberation times from 100ms to 400ms. 
10 Five microphones were used, four spaced in a line, 0.8m apart, and one about 2.7m from the line. The 

microphones were from 0.56m to 2.7m from the sound source, and the overall arrangement was designed 
to represent a press conference, with four microphones for speakers, and one extra on the ceiling. A 

O 

y§ heavily trimmed mean was used, with N=5, Af=l, allowing the highest and lowest signals to be trimmed 

^ off at the robust estimator before the mean is calculated. As indicated earlier, system performance 
M 

lgl should improve with more microphones. The simulations were performed with just five microphones to 
show that the technique can be usefijl with practical, inexpensive systems. 

A high-pass input filter 102 was placed after the microphones, with a 60-Hz cutoff frequency, to 
L simulate removal of low-frequency ventilation system noise. The processing was implemented with an 

^ 12-kHz sampling rate and with the optimal weights w i oc A~ 3/2 calculated using Equation (11) based on 

ry 

Lu 

g the assumption that the noise was equal at each microphone, where the amplifier gain A was independent 
CS of frequency. 



Simulation Results: Distortion on Focus 

In the first test, the nonlinearity of the system was measured by generating a tone burst with a 

Gaussian envelope (0=1 88ms), then measuring the power at harmonics of the driving frequency, at the 
25 output of the system. The simulated room was lightly damped so the reverberation time was only 100ms, 

and no noise was introduced. Under these conditions, the largest harmonic was the third, down 35dB 

from the fundamental (median ratio, 70Hz - 1800Hz). Under more reverberant conditions 

(r wwr6 =400ms), the third harmonic was down by 28dB from the fundamental. The distortion would 

decrease as the number of microphones is increased. 
30 Fig. 2 shows the dependence on frequency for the reverberant case. The two topmost curves 

show the power at the signal frequency for the linear and robust systems. The lower (dotted) curve 
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shows the third-harmonic power for the robust system, and the points scattered near the lower curve 
display the third-harmonic power for the robust system at three other choices of source and focus 
position. Fig. 3 shows the dependence of the distortion to the length of the tone burst. 

Distortion was also tested as a function of position, motivated by the observation that 

^distort 00 var ( v i ) > m ^ ^ at the arra y was a 4i uste d to have a small vai^v, ) at the focus, and a 
generally increasing variance as the source goes away from the focus. Fig. 4 shows the results of a test, 
where a tone burst source was scanned across the simulated room, and the system output was measured at 
the fundamental and at harmonics. Plotted is the average of tests at six frequencies between 300 Hz and 
1500 Hz. The third harmonic is the largest, and its median is 25dB below the on-focus signal. As 
expected, the fraction of power coming out in harmonics increases away from the focus, but that is 
loosely compensated by the reduction in total output power away from the focus, so that the power in the 
harmonics is roughly constant. 

Fig. 4 shows the expected reduction in distortion. Fig. 4 shows power in the fundamental and 
harmonics from a tone-burst source at different positions across a room. In Fig. 4, the linear microphone 
array is shown in the thick black curve, the fundamental frequency output of the robust estimator is 
shown in the thin black curve, and the third-harmonic output of the robust estimator is shown as black 
crosses. The source passes over one of the microphones at 1.25m, and passes through the array focus at 
2.5m. 

Simulation Results: Suppression of Rogue Signals 

A second test studied how well the system would suppress a signal from outside the focal 
volume. The simulated source was moved across a room with a 400-ms reverberation time while keeping 
to focus of the array fixed. The source produced a burst of band-limited Gaussian white noise (-3dB at 
1kHz). Total energy was measured at the output of the system, waiting until the reverberations died 
away, and including any harmonic generation in the total. 

Ideally, a strong response is desired when the source is in the focal volume, and a much smaller 
response is desired to a source out of the focus. Fig. 5 shows results from this test for both a prior-art 
linear combination and a nonlinear robust signal estimation of the present invention. At d=2.5m, the 
source was centered in the focal volume, and, at d=1.29m, the source passes through one of the 
microphones. The linear system behaves very badly when the source is near the microphone. In 
particular, the power from the one close microphone gets so large that the amplitude of the output signal 
diverges, even though the source is well outside the focal volume. The nonlinear system, on the other 
hand, avoids this divergence by clipping away the signal from the one close microphone. 
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Right near the microphone, the system with the robust estimator can have a very large rejection 
of undesired signals, relative to the linear system. The robust estimator suppresses signals at 1cm by 
>10dB. Any noise source within 10cm of any microphone will be suppressed by at least 3dB. Sources 
close to unimportant microphones (e.g., those far from the focus, or those with a poor SNR) will be 
5 suppressed even more effectively and over a larger volume, since such microphones receive less weight 
in the robust combination operation. 

Often (as seen in Fig. 5), the robust microphone array of the present invention behaves very much 
like the linear array, except near microphones. However, under reasonable conditions, it is possible for 
the robust microphone array to have improved rejection of rogue signals over a large volume of space, as 
10 shown in Fig. 6. Here, the robust system produces at least a 3dB better rejection ratio of rogue signals 
(relative to the focus) for d<lm, and produces 2dB better rejection for d>3m. The explanation for this 
improved rejection relates to the fact that the set of voltages feeding into the robust estimator module 108 
_ at any given instant is not likely to be particularly Gaussian, even if each signal, individually, has a 
yfj Gaussian amplitude distribution. It turns out that this distribution is particularly non-Gaussian away from 
1 W ' the focus. The long-tailed nature of the probability distribution of values into the robust estimator allows 
jjj it to preferentially trim off the largest inputs, and to do a better job of rejecting signals out of the focal 
hO volume. 

□ A toy model can be developed that shows the effect by working with white, Gaussian signals, 

frequency- independent amplifier gain, and by neglecting reflections. In this model, the appropriate gains 

201 are given by Equation (19) as follows: 

n s 

W G d j {co) = d]e- i ^l c , (19) 

ri 

^ where the superscript asterisk refers to the distances from the microphones to the focal point. The 
transfer function is given by Equation (20) as follows: 

H*(a>) = ±e'^ c , (20) 

J 

25 evaluated at the distance from the interfering source to the microphone. 

At the focal volume, the amplifier delays are set to cancel the propagation delays, so the signals 
at each input to the robust estimator module are highly correlated, and actually identical in this model. 
The variance of the inputs is zero, and the output of any central estimator, robust or not, is equal to the 
average of the inputs. 
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Almost everywhere away from the focus, where dj * d* , the amplifier delays do not match the 

propagation delay, and each input to the robust estimator module sees a statistically independent sample. 
The estimator inputs are then given by Equation (21) as follows: 



where if- are a set of independent, Gaussian random variables, with zero mean and variance 
proportional to the signal power. It may be assumed that vaify) = 1 without loss of generality. 

The probability distribution of j Vj j is then a mixture of several Gaussians according to 
Equation (22) as follows: 

P(v) = Iy -* e"^ , (22) 

n J pwj 

which is therefore non-Gaussian unless all r ) ; = = r . In three-dimensional space, with three or 

more microphones, the only point that makes P(v) strictly Gaussian is the focus. Elsewhere, some robust 
estimator will produce lower variance (and thus a lower output power) than the equivalent linear 
combination. If P(v) is far enough from a Gaussian, then the system will give a noticeable suppression 
for rogue signals. 

From the toy model, it can be seen that the largest effect will occur when one or more of the {r,} 
differ strongly from unity. This happens most strongly when one of the {r,} approaches zero. This is the 
'expected' case, where the noise source is close to a microphone. However, it also happens when one of 

the {r * } is small (i.e., when the focus is close to a microphone} . In this latter, unexpected case, P(v) 

can be noticeably non-Gaussian almost everywhere in the room, and the system can exhibit substantially 
better directivity than a linear system. 
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A pplication: Room De-Reverberation 

A room de-reverberation application applies the same core technique (use of a robust estimator to 
combine several microphone signals) in an iterative manner. In brief, the technique involves a 
microphone array focused on a desired signal source. Given an output signal, the digital filters on each 
microphone are adjusted to match all the microphone signals to that output signal. By matching all the 
microphone signals, the variance of the data going into the robust estimator is reduced, which will reduce 
the amount of distortion generated on the next pass. 

For this application, it is simpler to describe the algorithm as if all the data had been collected in 
advance, and stored data is being processed to find the optimal signal. Those skilled in the art can 
transform the description from an off-line post-processing system to an on-line system. One possible 
transformation to an on-line system is to assume that the room and source position change relatively 
slowly. The outputs from dynamic steering control 112 and dynamic estimation control 116 can then be 
calculated as time averages of quantities. One "pass" of the algorithm then corresponds roughly to the 
averaging time. The averaging time should be set long enough to get a sufficiently broad sample of the 
source signals, yet short enough so that the digital filters 104 and robust signal estimator 108 can be 
adapted to follow changes in the room acoustics. Alternatively, the entire system shown in Fig. 1 could 
be copied once for each pass, where the outputs of control modules 112 and 116 in the /2 th could affect 
the filters in the («+l) st pass. Multiple copies of the system are relatively easy for a software 
implementation. 

Typically, after a few iterations, the algorithm converges to a solution where the generated 
distortion is low, and the output signal is close to the source signal. In cases where there are no noise 
sources, the algorithm will often converge to zero distortion, where the output is related to the source 
signal by a simple linear filter. 

A preferred implementation contains steps for heuristically generating an estimate of the source 
spectrum (Step 7), and using that estimate to match the spectrum of the output signal to the spectrum of 
the source (Step 8). Other estimates of the source spectrum are possible for Step 7. Likewise, Step 8 
generates a filter from knowledge of the power spectrum, without phase information. Should phase 
information be available, a person skilled in the art could use it to generate a better filter for Step 8. 

This preferred implementation comprises the following steps: 
Step 1 : Read in the several microphone signals into m- (t) after correcting microphone frequency 

response with input filtering 102 of Fig. 1. 
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Initialize FIR filters (i.e., 104 or equivalently Hj(i) ) to align signals and to make their 
amplitudes match as well as possible. 

Filter the microphone signals with filters 104 and 106, according to Equation (23) as follows: 

j/0 = «y (')«#/('). (23) 
The signals should be nearly equal and nearly time aligned at the end of this step. 

Apply the robust estimator 108 to get a single signal estimate, according to Equation (24) as 
follows: 

q(t)=Robust({Sj(t)}) (24) 
Find the best linear FIR filters hj (t) (subject to length and other constraints), such that: 

q(t)*m J (t)®h J (t). (25) 

This is the construction of a linear predictor from m to q. 

Estimate the power spectrum Q(6)) of q(t), via fast Fourier transform. 

Calculate a single, representative power spectrum for the source signal from the several 
microphone signals. Typically, one takes the median (at each frequency) of power spectra 
from the microphone signals, such that: 

p(co) <- median & FFT(mj (o))) . (26) 

Construct a filter /(r) , whose transfer function (in the frequency domain) has magnitude 

^^/Qk®) ^ except where Q is t0 ° sma11 )* 0ne must be prepared to heuristically adjusts Q 

to make sure the denominator does not go near zero, but it rarely does, in practice. Typically, 
one constrains the length of the resulting filter in the time domain and/or trades off accuracy 
of the magnitude for a reduced norm of the filter. 

Construct updated filters for each channel H* (/) via: 

H)(t) = hj(t)®f(t). (27) 

These filters fulfill two purposes. First, they make the microphone signals as close as 
possible to the output of the robust estimator (and therefore, they are also close to each 
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other). Second, they match the overall output of the system to the estimate of the sources 
spectrum. 

Step 10: Decide if the algorithm has converged well enough to stop, or whether it should update the 
filters and loop around again. The decision is based on how close Hj (/) is to H J (/) , 

and/or how close the microphone signals match, after processing through the two versions of 
the filter. 

Step 11: If the algorithm needs more iterations, update H j (t) . Typically, one would use: 

Hj (t)^M' Hj (t) + (1 - //) ' H) (/) (28) 
with -1 < // < 1 , but other updating schemes could also be derived. 

When the algorithm converges, q(t) is an estimate of the source signal, without room reverberations, and 
Hj(t) are estimates of the room transfer function. Distortion levels can be very low, if Hj(f) 

converges to something close to the real room transfer function. 

Using a robust estimator according to the present invention (e.g., a trimmed mean or a median) to 
combine microphone signals can produce better directivity than a prior-art linear combination, when 
either a noise source or the focus is close to a microphone, with minimal degradation in other cases. The 
computational cost is low, and it does not make any assumptions about what the characteristics of either 
the noise or the signal are. For example, someone can tap his or her finger on any microphone in the 
array and hardly disturb the output. 

The present invention is computationally inexpensive, and does not require knowledge of the 
position of the noise source. It works on spread-out noise sources, so long as they are spread out over 
regions small compared to the array size. It also has the minor additional bonus of rejecting impulse 
noise at high frequencies, even from sources that are not near a microphone. 

The present invention may be implemented as circuit-based processes, including possible 
implementation on a single integrated circuit. As would be apparent to one skilled in the art, various 
functions of circuit elements may also be implemented in the digital domain as processing steps in a 
software program. Such software may be employed in, for example, a digital signal processor, micro- 
controller, or general-purpose computer. 

While the exemplary embodiments of the present invention have been described with respect to 
processes of circuits, including possible implementation as a single integrated circuit, the present 
invention is not so limited. As would be apparent to one skilled in the art, various functions of circuit 
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elements may also be implemented in the digital domain as processing steps in a software program. Such 
software may be employed in, for example, a digital signal processor, micro-controller, or general 
purpose computer. 

The present invention can be embodied in the form of methods and apparatuses for practicing 
5 those methods. The present invention can also be embodied in the form of program code embodied in 
tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage 
medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, 
the machine becomes an apparatus for practicing the invention. The present invention can also be 
embodied in the form of program code, for example, whether stored in a storage medium, loaded into 
10 and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over 
electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the 
program code is loaded into and executed by a machine, such as a computer, the machine becomes an 
n apparatus for practicing the invention. When implemented on a general-purpose processor, the program 
5 code segments combine with the processor to provide a unique device that operates analogously to 
lH J specific logic circuits. 

{H Unless explicitly stated otherwise, each numerical value and range should be interpreted as being 

jvf approximate as if the word "about" or "approximately" preceded the value of the value or range. 
0 It will be further understood that various changes in the details, materials, and arrangements of 

q the parts which have been described and illustrated in order to explain the nature of this invention may be 
M made by those skilled in the art without departing from the scope of the invention as expressed in the 

fy 

y following claims. 

D 
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